Add managed-memory advise, prefetch, and discard-prefetch free functions#1775
Add managed-memory advise, prefetch, and discard-prefetch free functions#1775rparolin wants to merge 74 commits intoNVIDIA:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
|
/ok to test |
|
question: Does making these member functions of the |
I'm moving this back into draft. We discussed in our team meeting because I was already hesitant as Buffer is becoming a 'God object' with the functionality is gaining. We were going to explore alternatives. Free functions sounds like a good alternative to explore. |
…ns in the cuda.core.managed_memory namespace
…ups, fix docs - Remove duplicate long-form "cu_mem_advise_*" string aliases from _MANAGED_ADVICE_ALIASES; users pass short strings or the enum directly - Replace 4 boolean allow_* params in _normalize_managed_location with a single allowed_loctypes frozenset driven by _MANAGED_ADVICE_ALLOWED_LOCTYPES - Cache immutable runtime checks: CU_DEVICE_CPU, v2 bindings flag, discard_prefetch support, and advice enum-to-alias reverse map - Collapse hasattr+getattr to single getattr in _managed_location_enum - Move _require_managed_discard_prefetch_support to top of discard_prefetch for fail-fast behavior - Fix docs build: reset Sphinx module scope after managed_memory section in api.rst so subsequent sections resolve under cuda.core - Add discard_prefetch pool-allocation test and comment on _get_mem_range_attr Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e legacy path The _V2_BINDINGS cache in _buffer.pyx persists across tests, so monkeypatching get_binding_version alone is insufficient when earlier tests have already populated the cache with the v2 value. Promote _V2_BINDINGS from cdef int to a Python-level variable so tests can monkeypatch it directly via monkeypatch.setattr, and reset it to -1 in both legacy-signature tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t real hardware These three tests call cuMemAdvise on real CUDA devices and verify memory range attributes. On devices without concurrent_managed_access (e.g. Windows/WDDM), set_read_mostly silently no-ops and set_preferred_location fails with CUDA_ERROR_INVALID_DEVICE. Use the stricter _skip_if_managed_location_ops_unsupported guard, matching the pattern already used by test_managed_memory_functions_accept_raw_pointer_ranges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s support Reorder checks in discard_prefetch so _normalize_managed_target_range runs before _require_managed_discard_prefetch_support. This ensures non-managed buffers raise ValueError before the RuntimeError for missing cuMemDiscardAndPrefetchBatchAsync support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ps module Move advise, prefetch, and discard_prefetch functions and their helpers out of _buffer.pyx into a new _managed_memory_ops Cython module to improve separation of concerns. Expose _init_mem_attrs and _query_memory_attrs as non-inline cdef functions in _buffer.pxd so the new module can reuse them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
api.rst still listed the single-buffer free functions and *Options dataclasses that were removed under R9/R11 (advise, prefetch, discard, discard_prefetch and their *Options classes). Replace with the actual cuda.core.utils exports: prefetch_batch, discard_batch, discard_prefetch_batch. Drop the now-orphan :template: dataclass.rst line.
|
Retraction on issue 2 above. After verification: issue 2 (cu12 ELSE branches calling broken cydriver signatures) is a false positive. CI builds cu12 cuda_core against cuda_bindings artifacts from the `12.9.x` backport branch (per `build-wheel.yml` and `ci/versions.yml`), and that branch's `cydriver.pxd.in` still exposes the v1 signatures: ```cython Those match the 4-arg int-device calls in `_managed_memory_ops.pyx` ELSE branches. The mismatch I flagged is only against `main`'s pxd.in, which post-cu13-cutover has been narrowed to v2 wrappers — but that's not what cu12 cuda_core compiles against. Sorry for the noise. Issue 1 (api.rst phantom symbols) still stands and is fixed in 4c228eb. |
Regarding
|
Mirror Device's singleton semantics so Host() is Host() and Host(numa_id=1) is Host(numa_id=1) hold. Host.numa_current() returns its own singleton, distinct from Host(), since it represents a thread-relative location rather than a fixed one. Construction routes through __new__ -> _get_or_create with a double-checked dict + Lock cache keyed on (numa_id, is_numa_current). __eq__ collapses to identity (consistent with the retained __hash__). __reduce__ added so pickled Host instances round-trip back through the singleton cache instead of stranding copies. Resolves PR NVIDIA#1775 review: leofang and Andy-Jost requested Host follow Device as a singleton so users can rely on `is` for identity checks.
Align with the graph module's AdjacencySetProxy: rename the class and inherit from collections.abc.MutableSet so the full set interface (remove, pop, clear, |=, &=, -=, ^=, isdisjoint, subset/superset operators, etc.) is filled in automatically from the existing add / discard / __contains__ / __iter__ / __len__ primitives. Add classmethod _from_iterable so binary set operators (&|^) produce plain sets rather than constructing a buffer-less proxy. Tighten add to TypeError on non-Device/Host inputs and discard / __contains__ to silently ignore them, matching MutableSet contracts. The hand-rolled __eq__ (set/frozenset comparison) is dropped: Set ABC's default implementation handles it correctly. Resolves PR NVIDIA#1775 review (Andy-Jost, 2026-05-04): naming consistency with AdjacencySetProxy and full MutableSet conformance.
|
@Andy-Jost Done. Resolved by 7126324.
|
- Annotate _instances / _instances_lock as ClassVar (RUF012). - Sort __slots__ alphabetically (RUF023, auto-fixed by ruff).
|
@leofang ping for a re-review. Your |
|
@rparolin as discussed earlier offline, would it be possible if we merge all P0 PRs first? I have a very long review backlog... Also, if @Andy-Jost has completed re-review, a new approval should be stamped to make it clear. |
Yup. As discussed, this is not priority. But the changes are ready for you to review whenever you are available. |
rwgk
left a comment
There was a problem hiding this comment.
The PR title and description seem to be very out-of-date. It looks like it'll be best to discard and re-generate from scratch (with an independent fresh agent that isn't polluted by the history with many revisions).
For the title, how about:
Add managed-memory ManagedBuffer class with advise / prefetch / discard-prefetch APIs, plus batched free functions
|
I switched back to using Cursor GPT-5.4 Extra High Fast (after feeling very frustrated with the undisclosed other Cursor model I used before). GPT-5.4 Extra High Fast was thinking significantly longer, but also gives far more concise findings without any further prompting: Findings
The Medium finding seems like the least actionable one, the other two seem worth drilling down into. |
Doh! Thank you. I'll regenerate the description now. |
bool is an int subclass, so the previous guard let Host(True) and Host(False) seed the singleton cache under the same keys as Host(1) and Host(0). Whichever call landed first won, leaving repr(Host(1)) potentially showing as Host(numa_id=True). Reject bool explicitly. Addresses rwgk's Low finding on PR NVIDIA#1775.
Move _require_managed_buffer to the first statement of _advise_one so a non-managed buffer is rejected before advice/location parsing, matching the order in _do_single_prefetch_py and _do_single_discard_prefetch_py. This prevents surfacing an advice-validation error when the real problem is the buffer kind.
Rephrase the RuntimeError raised from _to_legacy_device when a caller passes Host(numa_id=...) or Host.numa_current() on a CUDA 12 build. The new message names the unsupported APIs and points the user at Host() as the working alternative, instead of leaking the internal location_type discriminator.
The CUDA 12 cuMemPrefetchAsync / cuMemAdvise ABI takes a plain device ordinal and cannot represent a specific host NUMA node. Previously _coerce_location accepted Host(numa_id=...) and Host.numa_current() on a CUDA 12 build and let the operation fail late inside the Cython layer with RuntimeError, which the public APIs surfaced as a confusing error from deep in the stack. Reject NUMA-host kinds at the call boundary in _coerce_location with a TypeError that names the unsupported APIs and points at Host() as the working alternative. Update the ManagedBuffer docstring to match the new contract, and broaden two host_numa-rejection test asserts to accept either the CUDA 13 kind-allowed ValueError or the CUDA 12 boundary TypeError. Addresses rwgk's Medium finding on PR NVIDIA#1775.
The previous setter computed (current - target) and (target - current)
and called _advise_one in two loops. set(locations) raised TypeError
on unhashable elements, but only after the first diff pair had already
been issued, so an invalid RHS could leave accessed_by partially
mutated. Reproduce: starting from {Device(0)}, assigning
{Host(numa_id=0)} on CUDA 12 raises and leaves accessed_by == set().
Validate every target up-front (per-element isinstance(Device|Host))
and only then issue the diff loops, so a bad RHS raises before any
driver state changes.
Addresses rwgk's High finding on PR NVIDIA#1775.
|
@rwgk re your High finding ( Done. 1b66367. The setter now validates every RHS element ( |
|
@rwgk re your Medium finding (CUDA 12 incoherence with Done. bcc056b.
|
|
@rwgk re your Low finding ( Done. d0b6621.
|
Collapses multi-line string concats and conditions back to single lines under the project's line-length limit. No behavior change.
…m_advise_prefetch # Conflicts: # cuda_core/docs/source/release/1.0.0-notes.rst
Host(numa_id=N) and Host.numa_current() require CUDA 13 bindings; the TestLocationCoerce passthroughs were missing the binding_version guard already used by test_preferred_location_roundtrip_host_numa.
Summary
Adds managed-memory range operations to
cuda.core:cuda.core.utils:advise,prefetch,discard,discard_prefetch. Each accepts either a singleBufferor a sequence; N==1 dispatches to the per-range driver entry point and N>1 dispatches to the correspondingcuMem*BatchAsync(CUDA 13+).Host— new top-level singleton class symmetric toDevice.Host()(any host),Host(numa_id=N),Host.numa_current(). Same-argument constructions are interned (Host() is Host()). Used together withDeviceto express managed-memory locations.ManagedBuffer—Buffersubclass returned byManagedMemoryResource.allocate. Exposes a Pythonic property-style advice API on top of the same free functions. Wrap an external managed pointer withBuffer.from_handle(...)(now a@classmethod, soManagedBuffer.from_handle(...)returns aManagedBuffer).Closes #1332. Addresses the managed-memory portion of #1333 (P1:
cuMemPrefetchBatchAsync,cuMemDiscardBatchAsync,cuMemDiscardAndPrefetchBatchAsync). The P0cuMemcpyBatchAsyncfrom #1333 is intentionally out of scope and tracked separately; the holistic batched-API contract this PR commits to is documented in #issuecomment-4355502334 so the upcomingcuMemcpyBatchAsyncwork can mirror it.Public API
ManagedBuffer— property-style advice on managed allocationsManagedMemoryResource.allocatereturns aManagedBuffer(aBuffersubclass). All ManagedBuffer-specific behavior is layered on top of the free functions, so the two surfaces stay consistent.Free functions —
advise/prefetch/discard/discard_prefetchEach accepts a
Buffer(orManagedBuffer) or a sequence of them. Locations are expressed viaDeviceorHost.Batched form — same function, sequence of targets
When N>1, dispatch goes to the corresponding
cuMem*BatchAsync. Sequence locations are paired by index; a scalar location broadcasts to every target.Mismatched sequence lengths raise
ValueError. On a CUDA 12 build ofcuda.core, N>1 raisesNotImplementedError(the*BatchAsyncentry points are CUDA 13+); N==1 works on every supported toolkit.Putting it together
Implementation notes
cuda_core/cuda/core/_memory/_managed_memory_ops.pyxusescimport cydriverfor direct C-level driver calls.cuMemAdviseandcuMemPrefetchAsyncis handled at compile time withIF CUDA_CORE_BUILD_MAJOR >= 13:/ELSE:.cuMemPrefetchBatchAsync,cuMemDiscardBatchAsync,cuMemDiscardAndPrefetchBatchAsync) are CUDA 13+ only. On CUDA 12 builds, N>1 calls raiseNotImplementedError; single-buffer calls work everywhere.Hostis a singleton with__slots__and a__new__-based intern cache keyed by(numa_id, is_numa_current). Same-argument constructions return the same instance on both Python and Cython call paths.ManagedBufferis a pure-Python subclass of the CythonBuffercdef class.Buffer.from_handleis now a@classmethod(was@staticmethod) soMyBufferSubclass.from_handle(...)returns the typed instance viacls._init.Buffer_from_deviceptr_handleand_MP_allocatethread an optionalclsparameter soManagedMemoryResource.allocatematerializes aManagedBuffer._LocSpec(in_managed_location.py) carries the(kind, id)discriminator that the Cython layer maps toCUmemLocation(CUDA 13) or a legacy device ordinal (CUDA 12). Public callers see onlyDevice/Host;_coerce_locationproduces the internal record._buffer.pyxcollapsesout.is_managed = (is_managed != 0)to a single unconditional assignment and adds a TODO noting that HMM/ATS-mapped sysmem is not yet captured byCU_POINTER_ATTRIBUTE_IS_MANAGED.